-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal : fix fa kernel #9187
metal : fix fa kernel #9187
Conversation
I cannot reproduce this on M3 Max, but a race condition seems like it would be the most likely cause. If that's the case, it is likely to create more issues in the future, so it would be good to fix it completely rather than making some changes that make the issue less likely, but still possible. |
I found another fix with #9188, but I still don't understand why Another observation is that the race is likely inside this block: llama.cpp/ggml/src/ggml-metal.metal Lines 2142 to 2160 in 7a3df79
The reason is that if I change the diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index aba0b9a0..df1be087 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -2141,8 +2141,9 @@ kernel void kernel_flash_attn_ext_f16(
simdgroup_store(mqk, ss + 8*cc, TF, 0, false);
- const short tx = tiisg%4;
- const short ty = tiisg/4;
+ if (tiisg == 0) {
+ for (short ty = 0; ty < 8; ++ty) {
+ for (short tx = 0; tx < 4; ++tx) {
// mqk = mqk*scale
ss[8*cc + ty*TF + 2*tx + 0] *= scale;
@@ -2157,6 +2158,10 @@ kernel void kernel_flash_attn_ext_f16(
// mqk = mqk + mask*slope
ss[8*cc + ty*TF + 2*tx + 0] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 0];
ss[8*cc + ty*TF + 2*tx + 1] += slope*mp[ic + 8*cc + ty*nb31/sizeof(half) + 2*tx + 1];
+ }
+
+ }
+ }
}
}
} Any ideas what could be the issue? I'm not sure how to debug this further |
I don't know what threads use what data, and I can't test it either. Intuitively I would assume that a barrier is necessary before the softmax to synchronize the shared memory in |
It does look like a barrier is necessary, though adding one did not fix the problem. However, I rewrote the code to make the threads access the data more coherently and this resolved the problem: #9189 |
Superseded by #9189 |
cont #9159
For some reason,
test-backend-ops
fails onmaster
on M2 Ultra with the latest changes from #9159:Maybe a race condition or running out of registers? Not sure.
The proposed change in this PR fixes the tests and the performance remains the same: